Abstract
Here we will talk about supervised vs. unsupervised manifold learning and introduce Multi-OMICs Factor Analysis (MOFA) as an example of unsupervised data exploration. Further, we will perform integrative OMICs analysis of chromatin accessibility, DNA methylation and transcriptome profiling data from the scNMT-seq (single-cell nucleosome, methylation and transcription sequencing) using MOFA.
To compare supervised and unsupervise analyses is like comparing The Beatles and The Velvet Underground music. Both are good, just good in different ways. While supervised model is dominating in the modern research since it provides guaranteed but slow progress, the unsupervised strategy has always been considered as a “high risk high gain” approach.
MOFA is a typical hypothesis-free data exploration framework, https://www.embopress.org/doi/10.15252/msb.20178124. It allows data integration by extracting common axes of variation from the multiple OMICs layers. Given several data matrices with measurements of multiple ‘omics data types on the same or on overlapping sets of samples, MOFA infers a low-dimensional data representation in terms of (hidden) factors which can be then visualized and interpreted.
Importantly, MOFA disentangles whether the underlying axes of heterogeneity are unique to a single data modality or are manifested in multiple modalities, thereby identifying links between the different ‘omics. Once trained, the model output can be used for a range of downstream analyses, including
How does a typical MOFA output look like? In the MOFA paper they applied this method to Chronic Lymphocytic Leukaemia (CLL) on 200 human patients that combined 1) drug response, 2) somatic mutations (targeted sequencing), 3) RNAseq data, and 4) Methylation array data.
Notably, nearly 40% of the 200 samples were profiled with some but not all OMICs types; such a missing value scenario is not uncommon in large cohort studies, and MOFA is designed to cope with it.
One can inspect loadings of MOFA hidden factors and observe known bio-markers fo CLL. Specifically, factor 1 was aligned with the somatic mutation status of the immunoglobulin heavy-chain variable region gene (IGHV), while Factor 2 aligned with trisomy of chromosome 12.
Loadings from each factor can be used for pathway enrichment analysis. Interestingly, markers from diffeerent OMICs co-varying in a certain MOFA factor can be shown to be coupled to the same biological pathway. For CLL data, factor 5 demonstrates co-variation between drug response and mRNA levels. This factor tagged a set of genes that were highly enriched for oxidative stress and senescence pathways. Consistent with the annotation based on the mRNA view, it is observed that the drugs with the strongest weights on Factor 5 were associated with response to oxidative stress, such as target reactive oxygen species, DNA damage response and apoptosis.
Next MOFA can perform efficient imputation of missing values including imputation of “missing patients”, i.e. when a patient is present in one OMICs layer but absent in another, this absent layer can be reconstructed for the patient, i.e. gray bars in the panel a) of the figure above can be filled.
Finally, latent factors inferred by MOFA are predictive of clinical outcomes. The authors of MOFa paper assessed the prediction performance when combining the 10 MOFA factors in a multivariate Cox regression model (assessed using cross-validation). Notably, this model yielded higher prediction accuracy than the 10 factors derived from conventional PCA or using the individual features.
Important to realize that despite the results of PCA and Factor Analysis visually look very similar, as they both provide linear variance-covariance matrix decomposition, they have quite a different math behind. While PCA is a pure matrix factorization technique which splits the total variance into orthogonal Principal Components (PCs), Factor Analysis seeks to construct hidden latent variables that generate the observed data, therefore Factor Analysis is a generative model.
In this section we will apply MOFA to the single cell multi-OMICs data set scNMT which profiles chromatine accessebility (scATACseq), DNA methylation (scBSseq) and gene expression (scRNAseq) information from the same biological cell belonging to two types: differentiating mouse embryonic stem cells (ESCs) and embrionic bodies (EBs). We will start with reading and having a look at the individual OMICs data sets:
scRNAseq<-read.delim("scRNAseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
ens2genesymbol<-read.delim("ENSEMBLE_TO_GENE_SYMBOL_MOUSE.txt")
ens2genesymbol<-ens2genesymbol[match(colnames(scRNAseq),as.character(ens2genesymbol$ensembl_gene_id)),]
colnames(scRNAseq)<-ens2genesymbol$external_gene_name
scRNAseq<-as.data.frame(t(scRNAseq))
scRNAseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## Mrpl15 557.4201 130.8536 305.7356 417.1656 573.6504
## Lypla1 410.5958 138.8325 186.3237 106.0895 662.7008
## Tcea1 437.9119 356.3899 276.9121 161.8315 213.3068
## Atp6v1h 345.7199 500.5417 195.5884 133.0614 325.1376
## Oprk1 0.0000 0.0000 0.0000 0.0000 0.0000
scBSseq<-read.delim("scBSseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
scBSseq<-as.data.frame(t(scBSseq))
scBSseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101606896 0.00000 0 0 0 0
## 1_101935054 88.88889 100 100 100 30
## 1_101935061 88.88889 100 100 100 90
## 1_102002184 100.00000 100 0 0 0
## 1_102238204 100.00000 100 100 100 100
scATACseq<-read.delim("scATACseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
scATACseq<-as.data.frame(t(scATACseq))
scATACseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100392139 100 100 100 100 67
## 1_100392151 100 100 100 50 67
## 1_100668590 50 86 44 45 28
## 1_100738967 83 88 83 83 88
## 1_100994324 0 0 11 75 0
For scRNAseq OMIC layer we will only select highly expressed genes in order to remove the noisy genes which might contaminate the further downstream analysis. We will also perform log-transform of the data which can be seen as a mild normalization:
scRNAseq<-scRNAseq[rowMeans(scRNAseq)>=1,]
scRNAseq<-log10(scRNAseq + 1)
scRNAseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## Mrpl15 2.746961 2.120092 2.486764 2.621348 2.759404
## Lypla1 2.614471 2.145608 2.272593 2.029747 2.821972
## Tcea1 2.642377 2.553142 2.443907 2.211738 2.331036
## Atp6v1h 2.539979 2.700307 2.293558 2.127304 2.513401
## Oprk1 0.000000 0.000000 0.000000 0.000000 0.000000
dim(scRNAseq)
## [1] 12145 113
Since scBSseq and scATACseq OMIC layers should be almost a binary (methylated vs unmethylated for scBSseq and open vs. close for scATACseq) data sets, a good way to get rid of redundancy in the scBSseq and scATACseq data and thus perform feature pre-selection and reduce dimensions is to select only features with high variability:
library("mixOmics")
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.3.1
##
## Visit http://www.mixOmics.org for more details about our methods.
## Any bug reports or comments? Notify us at mixomics at math.univ-toulouse.fr or https://bitbucket.org/klecao/package-mixomics/issues
##
## Thank you for using mixOmics!
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scBSseq)))
head(my_nearZeroVar$Metrics)
## freqRatio percentUnique
## 1_101606896 112.00 1.7699115
## 1_105525627 35.00 5.3097345
## 1_105539120 26.25 5.3097345
## 1_10605572 0.00 0.8849558
## 1_107485472 0.00 0.8849558
## 1_109269874 20.00 7.9646018
dim(my_nearZeroVar$Metrics)
## [1] 3290 2
scBSseq <- scBSseq[-which(rownames(scBSseq)%in%rownames(my_nearZeroVar$Metrics)),]
scBSseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054 88.88889 100 100 100 30
## 1_101935061 88.88889 100 100 100 90
## 1_102002184 100.00000 100 0 0 0
## 1_102238204 100.00000 100 100 100 100
## 1_102255678 100.00000 100 100 100 0
dim(scBSseq)
## [1] 5285 113
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scATACseq)), uniqueCut = 1)
head(my_nearZeroVar$Metrics)
## freqRatio percentUnique
## 1_10401731 0 0.8849558
## 1_128314944 0 0.8849558
## 1_13628215 0 0.8849558
## 1_178804829 0 0.8849558
## 1_183332775 0 0.8849558
## 1_183944178 0 0.8849558
dim(my_nearZeroVar$Metrics)
## [1] 91 2
scATACseq <- scATACseq[-which(rownames(scATACseq)%in%rownames(my_nearZeroVar$Metrics)),]
scATACseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100392139 100 100 100 100 67
## 1_100392151 100 100 100 50 67
## 1_100668590 50 86 44 45 28
## 1_100738967 83 88 83 83 88
## 1_100994324 0 0 11 75 0
dim(scATACseq)
## [1] 11709 113
Let us now have a look at the histograms of individual OMICs layers in order to decide which distribution they follow and how we should model these distributions with MOFA:
hist(rowMeans(scRNAseq),breaks=100, main = "scRNAseq")
hist(rowMeans(scBSseq), breaks = 100, main = "scBSseq")
hist(rowMeans(scATACseq), breaks = 100, main = "scATACseq")
We conclude that while scRNAseq data look fairly Gaussian, we should probably model the scBSseq and scATACseq data as following Bernoulli distribution as they look quite bimodal indicating the binary nature of the data, i.e. methylated vs. unmethylated for scBSseq and open vs. close for scATACseq. We will further make the scBSseq and scATACseq data sets purely binary by encoding values below 50 as 0 and above 50 as 1, we need to remove low-variance features in this case again:
scBSseq<-ifelse(scBSseq<50,0,1)
scBSseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054 1 1 1 1 0
## 1_101935061 1 1 1 1 1
## 1_102002184 1 1 0 0 0
## 1_102238204 1 1 1 1 1
## 1_102255678 1 1 1 1 0
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scBSseq)))
head(my_nearZeroVar$Metrics)
## freqRatio percentUnique
## 1_110547534 21.60 1.7699115
## 1_133172191 27.25 1.7699115
## 1_139117169 27.25 1.7699115
## 1_146093078 21.60 1.7699115
## 1_194260338 112.00 1.7699115
## 1_24611678 0.00 0.8849558
dim(my_nearZeroVar$Metrics)
## [1] 761 2
scBSseq <- as.data.frame(scBSseq[-which(rownames(scBSseq)%in%rownames(my_nearZeroVar$Metrics)),])
scBSseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054 1 1 1 1 0
## 1_101935061 1 1 1 1 1
## 1_102002184 1 1 0 0 0
## 1_102238204 1 1 1 1 1
## 1_102255678 1 1 1 1 0
dim(scBSseq)
## [1] 4524 113
scATACseq<-ifelse(scATACseq<50,0,1)
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scATACseq)))
head(my_nearZeroVar$Metrics)
## freqRatio percentUnique
## 1_100392139 112.00 1.769912
## 1_100392151 27.25 1.769912
## 1_100738967 55.50 1.769912
## 1_101897864 55.50 1.769912
## 1_102283335 55.50 1.769912
## 1_102563492 112.00 1.769912
dim(my_nearZeroVar$Metrics)
## [1] 2238 2
scATACseq <- as.data.frame(scATACseq[-which(rownames(scATACseq)%in%rownames(my_nearZeroVar$Metrics)),])
scATACseq[1:5,1:5]
## ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100668590 1 1 0 0 0
## 1_100994324 0 0 0 1 0
## 1_100994328 0 0 0 0 0
## 1_100994336 0 0 0 0 0
## 1_100994343 0 0 0 0 1
dim(scATACseq)
## [1] 9471 113
Let us continue with creating MOFA object:
library("MOFA")
##
## Attaching package: 'MOFA'
## The following object is masked from 'package:stats':
##
## predict
omics<-list(scRNAseq=scRNAseq,scBSseq=scBSseq,scATACseq=scATACseq)
lapply(omics,dim)
## $scRNAseq
## [1] 12145 113
##
## $scBSseq
## [1] 4524 113
##
## $scATACseq
## [1] 9471 113
MOFAobject <- createMOFAobject(omics)
## Creating MOFA object from list of matrices,
## please make sure that samples are columns and features are rows...
plotDataOverview(MOFAobject)
#DEFINE DATA OPTIONS
DataOptions <- getDefaultDataOptions()
DataOptions
## $scaleViews
## [1] FALSE
##
## $removeIncompleteSamples
## [1] FALSE
#DEFINE MODEL OPTIONS
ModelOptions <- getDefaultModelOptions(MOFAobject)
mydistr<-c("gaussian","bernoulli","bernoulli")
names(mydistr)<-c("scRNAseq","scBSseq","scATACseq")
ModelOptions$likelihood<-mydistr
ModelOptions$numFactors<-20
ModelOptions
## $likelihood
## scRNAseq scBSseq scATACseq
## "gaussian" "bernoulli" "bernoulli"
##
## $numFactors
## [1] 20
##
## $sparsity
## [1] TRUE
#DEFINE TRAIN OPTIONS
TrainOptions <- getDefaultTrainOptions()
TrainOptions$seed <- 2018
# Automatically drop factors that explain less than 1% of variance in all omics
TrainOptions$DropFactorThreshold <- 0.03
TrainOptions$tolerance <- 0.1
TrainOptions$maxiter<-1000
TrainOptions$verbose<-TRUE
TrainOptions
## $maxiter
## [1] 1000
##
## $tolerance
## [1] 0.1
##
## $DropFactorThreshold
## [1] 0.03
##
## $verbose
## [1] TRUE
##
## $seed
## [1] 2018
Let us run MOFA:
MOFAobject <- prepareMOFA(MOFAobject, DataOptions = DataOptions,
ModelOptions = ModelOptions, TrainOptions = TrainOptions)
## Checking data options...
## Checking training options...
## Checking model options...
MOFAobject <- runMOFA(MOFAobject)
## [1] "No output file provided, using a temporary file..."
Let us check the covariance of the OMICs layers:
#ANALYZE RESULTS OF MOFA INTEGRATION
r2 <- calculateVarianceExplained(MOFAobject)
r2$R2Total
## scATACseq scBSseq scRNAseq
## 0.04628196 0.05484257 0.13215154
head(r2$R2PerFactor)
## scATACseq scBSseq scRNAseq
## LF1 0.0458775605 1.482963e-03 0.06899144
## LF2 0.0003573541 5.366988e-02 0.02739404
## LF3 0.0001293258 7.427593e-05 0.03848415
plotVarianceExplained(MOFAobject)
We can see that scRNAseq provides the largest variation (13%) in the total integrative OMICs scNMT data set, scBSseq and scATACseq contribute around 5% of variation. We also observe that MOFA selected 3 Latent Factors (LFs), scRNAseq contributes to all of them while second LF is driven by scBSseq and scATACseq contributes only to the first LF, scRNAseq covaries with scATACseq in the first LF while scRNAseq covary with scBSseq in the second LF. Now let us interpret the LFs by displaying the features associated with each LF:
NumFactors<-dim(getFactors(MOFAobject))[2]
plotWeights(MOFAobject, view = "scRNAseq", factor = 1)
plotTopWeights(object = MOFAobject, view = "scRNAseq", factor = 1, nfeatures = 10)
plotDataHeatmap(object = MOFAobject, view = "scRNAseq", factor = "LF1", features = 10, transpose = TRUE, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)
plotWeights(MOFAobject, view = "scBSseq", factor = 2)
plotTopWeights(object = MOFAobject, view = "scBSseq", factor = 2, nfeatures = 10)
plotDataHeatmap(object = MOFAobject, view = "scBSseq", factor = "LF2", features = 10, transpose = TRUE, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)
plotWeights(MOFAobject, view = "scATACseq", factor = 1)
plotTopWeights(object = MOFAobject, view = "scATACseq", factor = 1, nfeatures = 10)
plotDataHeatmap(object = MOFAobject, view = "scATACseq", factor = "LF1", features = 10, transpose = TRUE, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)
Let us also display the cells in the low-dimensional latent space:
controls<-c("EB_P1D12","EB_P1E12","EB_P1F12","EB_P1G12","EB_P2B12","EB_P2D12","EB_P2E12","EB_P2F12","EB_P2G12",
"ESC_B12","ESC_C12","ESC_D12")
colnames(scRNAseq)<-ifelse(colnames(scRNAseq)%in%controls,"CONTROL_CELL",colnames(scRNAseq))
cell_types<-matrix(unlist(strsplit(colnames(scRNAseq),"_")),ncol=2,byrow=TRUE)[,1]
plotFactorScatter(MOFAobject, factors = 1:2, color_by = as.factor(cell_types))
We conclude that the integrative OMICs approach can distinguish between all three types of cells: ESCs, EBs and Controls.